15 research outputs found

    Social Scene Understanding: End-to-End Multi-Person Action Localization and Collective Activity Recognition

    Get PDF
    We present a unified framework for understanding human social behaviors in raw image sequences. Our model jointly detects multiple individuals, infers their social actions, and estimates the collective actions with a single feed-forward pass through a neural network. We propose a single architecture that does not rely on external detection algorithms but rather is trained end-to-end to generate dense proposal maps that are refined via a novel inference scheme. The temporal consistency is handled via a person-level matching Recurrent Neural Network. The complete model takes as input a sequence of frames and outputs detections along with the estimates of individual actions and collective activities. We demonstrate state-of-the-art performance of our algorithm on multiple publicly available benchmarks

    Variational Methods for Human Modeling

    Get PDF
    A large part of computer vision research is devoted to building models and algorithms aimed at understanding human appearance and behaviour from images and videos. Ultimately, we want to build automated systems that are at least as capable as people when it comes to interpreting humans. Most of the tasks that we want these systems to solve can be posed as a problem of inference in probabilistic models. Although probabilistic inference in general is a very hard problem of its own, there exists a very powerful class of inference algorithms, variational inference, which allows us to build efficient solutions for a wide range of problems. In this thesis, we consider a variety of computer vision problems targeted at modeling human appearance and behaviour, including detection, activity recognition, semantic segmentation and facial geometry modeling. For each of those problems, we develop novel methods that use variational inference to improve the capabilities of the existing systems. First, we introduce a novel method for detecting multiple potentially occluded people in depth images, which we call DPOM. Unlike many other approaches, our method does probabilistic reasoning jointly, and thus allows to propagate knowledge about one part of the image evidence to reason about the rest. This is particularly important in crowded scenes involving many people, since it helps to handle ambiguous situations resulting from severe occlusions. We demonstrate that our approach outperforms existing methods on multiple datasets. Second, we develop a new algorithm for variational inference that works for a large class of probabilistic models, which includes, among others, DPOM and some of the state-of-the-art models for semantic segmentation. We provide a formal proof that our method converges, and demonstrate experimentally that it brings better performance than the state-of-the-art on several real-world tasks, which include semantic segmentation and people detection. Importantly, we show that parallel variational inference in discrete random fields can be seen as a special case of proximal gradient descent, which allows us to benefit from many of the advances in gradient-based optimization. Third, we propose a unified framework for multi-human scene understanding which simultaneously solves three tasks: multi-person detection, individual action recognition and collective activity recognition. Within our framework, we introduce a novel multi-person detection scheme, which relies on variational inference and jointly refines detection hypotheses instead of relying on suboptimal post-processing. Ultimately, our model takes as an inputs a frame sequence and produces a comprehensive description of the scene. Finally, we experimentally demonstrate that our method brings better performance than the state-of-the-art. Fourth, we propose a new approach for learning facial geometry with deep probabilistic models and variational methods. Our model is based on a variational autoencoder with multiple sets of hidden variables, which are capturing various levels of deformations, ranging from global to local, high-frequency ones. We experimentally demonstrate the power of the model on a variety of fitting tasks. Our model is completely data-driven and can be learned from a relatively small number of individuals

    NPC: Neural Point Characters from Video

    Full text link
    High-fidelity human 3D models can now be learned directly from videos, typically by combining a template-based surface model with neural representations. However, obtaining a template surface requires expensive multi-view capture systems, laser scans, or strictly controlled conditions. Previous methods avoid using a template but rely on a costly or ill-posed mapping from observation to canonical space. We propose a hybrid point-based representation for reconstructing animatable characters that does not require an explicit surface model, while being generalizable to novel poses. For a given video, our method automatically produces an explicit set of 3D points representing approximate canonical geometry, and learns an articulated deformation model that produces pose-dependent point transformations. The points serve both as a scaffold for high-frequency neural features and an anchor for efficiently mapping between observation and canonical space. We demonstrate on established benchmarks that our representation overcomes limitations of prior work operating in either canonical or in observation space. Moreover, our automatic point extraction approach enables learning models of human and animal characters alike, matching the performance of the methods using rigged surface templates despite being more general. Project website: https://lemonatsu.github.io/npc/Comment: Project website: https://lemonatsu.github.io/npc

    Masksembles for Uncertainty Estimation

    Full text link
    Deep neural networks have amply demonstrated their prowess but estimating the reliability of their predictions remains challenging. Deep Ensembles are widely considered as being one of the best methods for generating uncertainty estimates but are very expensive to train and evaluate. MC-Dropout is another popular alternative, which is less expensive, but also less reliable. Our central intuition is that there is a continuous spectrum of ensemble-like models of which MC-Dropout and Deep Ensembles are extreme examples. The first uses an effectively infinite number of highly correlated models while the second relies on a finite number of independent models. To combine the benefits of both, we introduce Masksembles. Instead of randomly dropping parts of the network as in MC-dropout, Masksemble relies on a fixed number of binary masks, which are parameterized in a way that allows to change correlations between individual models. Namely, by controlling the overlap between the masks and their density one can choose the optimal configuration for the task at hand. This leads to a simple and easy to implement method with performance on par with Ensembles at a fraction of the cost. We experimentally validate Masksembles on two widely used datasets, CIFAR10 and ImageNet

    Drivable 3D Gaussian Avatars

    Full text link
    We present Drivable 3D Gaussian Avatars (D3GA), the first 3D controllable model for human bodies rendered with Gaussian splats. Current photorealistic drivable avatars require either accurate 3D registrations during training, dense input images during testing, or both. The ones based on neural radiance fields also tend to be prohibitively slow for telepresence applications. This work uses the recently presented 3D Gaussian Splatting (3DGS) technique to render realistic humans at real-time framerates, using dense calibrated multi-view videos as input. To deform those primitives, we depart from the commonly used point deformation method of linear blend skinning (LBS) and use a classic volumetric deformation method: cage deformations. Given their smaller size, we drive these deformations with joint angles and keypoints, which are more suitable for communication applications. Our experiments on nine subjects with varied body shapes, clothes, and motions obtain higher-quality results than state-of-the-art methods when using the same training and test data.Comment: Website: https://zielon.github.io/d3ga

    Modeling Facial Geometry using Compositional VAEs

    Get PDF
    We propose a method for learning non-linear face geometry representations using deep generative models. Our model is a variational autoencoder with multiple levels of hidden variables where lower layers capture global geometry and higher ones encode more local deformations. Based on that, we propose a new parameterization of facial geometry that naturally decomposes the structure of the human face into a set of semantically meaningful levels of detail. This parameterization enables us to do model fitting while capturing varying level of detail under different types of geometrical constraints.

    Dressing Avatars: Deep Photorealistic Appearance for Physically Simulated Clothing

    Full text link
    Despite recent progress in developing animatable full-body avatars, realistic modeling of clothing - one of the core aspects of human self-expression - remains an open challenge. State-of-the-art physical simulation methods can generate realistically behaving clothing geometry at interactive rates. Modeling photorealistic appearance, however, usually requires physically-based rendering which is too expensive for interactive applications. On the other hand, data-driven deep appearance models are capable of efficiently producing realistic appearance, but struggle at synthesizing geometry of highly dynamic clothing and handling challenging body-clothing configurations. To this end, we introduce pose-driven avatars with explicit modeling of clothing that exhibit both photorealistic appearance learned from real-world data and realistic clothing dynamics. The key idea is to introduce a neural clothing appearance model that operates on top of explicit geometry: at training time we use high-fidelity tracking, whereas at animation time we rely on physically simulated geometry. Our core contribution is a physically-inspired appearance network, capable of generating photorealistic appearance with view-dependent and dynamic shadowing effects even for unseen body-clothing configurations. We conduct a thorough evaluation of our model and demonstrate diverse animation results on several subjects and different types of clothing. Unlike previous work on photorealistic full-body avatars, our approach can produce much richer dynamics and more realistic deformations even for many examples of loose clothing. We also demonstrate that our formulation naturally allows clothing to be used with avatars of different people while staying fully animatable, thus enabling, for the first time, photorealistic avatars with novel clothing.Comment: SIGGRAPH Asia 2022 (ACM ToG) camera ready. The supplementary video can be found on https://research.facebook.com/publications/dressing-avatars-deep-photorealistic-appearance-for-physically-simulated-clothing
    corecore